Evaluation: Use Config Management #477

avirajsingh7 · 2025-12-10T04:44:18Z

Summary

Target issue is #557 & #547

This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for config_id, config_version, and model in the evaluation run table, streamlining the evaluation process and improving data integrity.

Checklist

Before submitting a pull request, please ensure that you mark these tasks.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

Refactor
- Evaluation runs now reference stored configs by ID and version; model names are resolved at runtime from those stored configs.
New Features
- Runtime resolution of model name from stored evaluation config.
- Default embedding model standardized to "text-embedding-3-large".
Behavior
- Dataset upload/list endpoints now operate at /evaluations/datasets.
Chores
- DB migration adds config_id and config_version to evaluation runs.
API
- Evaluation creation accepts config_id and config_version instead of inline config.
Tests
- Tests updated to use the stored-config workflow and new payloads.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-10T04:44:28Z

📝 Walkthrough

Walkthrough

Replaces embedded evaluation config dicts with stored references (config_id: UUID, config_version: int), adds resolver functions to fetch/resolve stored config blobs and derive model names, hard-codes embedding model to text-embedding-3-large, updates DB migration and tests, and re-exports the new resolver functions.

Changes

Cohort / File(s)	Summary
Data Models `backend/app/models/evaluation.py`	Replaced embedded `config: dict` with `config_id: UUID` and `config_version: int` on `EvaluationRun` and `EvaluationRunPublic` (nullable, version ≥ 1).
Evaluation CRUD Core `backend/app/crud/evaluations/core.py`	`create_evaluation_run` now accepts/stores `config_id` & `config_version`; added `resolve_evaluation_config(session, config_id, config_version, project_id)` and `resolve_model_from_config(session, eval_run)` using `ConfigVersionCrud` + `resolve_config_blob`; added UUID/LLM imports, logging and docstring updates.
Embedding CRUD `backend/app/crud/evaluations/embeddings.py`	Added `EMBEDDING_MODEL = "text-embedding-3-large"` constant; embedding flows use this constant instead of reading model from eval run config.
Processing CRUD `backend/app/crud/evaluations/processing.py`	Replaced direct `eval_run.config` access with `resolve_model_from_config(session, eval_run)` and pass resolved model into Langfuse dataset run creation.
Batch CRUD `backend/app/crud/evaluations/batch.py`	Switched `build_evaluation_jsonl` / `start_evaluation_batch` to accept `KaapiLLMParams`; JSONL builder extracts explicit fields, skips items without question, and stores batch config via `model_dump(exclude_none=True)`.
Service: Evaluations `backend/app/services/evaluations/evaluation.py`	`start_evaluation` signature changed to accept `config_id` & `config_version`; resolves stored config via `ConfigVersionCrud`/`resolve_config_blob`, enforces provider == `OPENAI`, uses resolved completion params for batch start; removed `build_evaluation_config`.
LLM Provider Registry `backend/app/services/llm/providers/registry.py`	Added `LLMProvider.OPENAI = "openai"` and registered it in the provider mapping.
Module Exports `backend/app/crud/evaluations/__init__.py`, `backend/app/services/evaluations/__init__.py`	Re-exported `resolve_evaluation_config` and `resolve_model_from_config`; removed `build_evaluation_config` re-export from services evaluations init.
API Routes `backend/app/api/routes/evaluations/evaluation.py`, `backend/app/api/routes/evaluations/dataset.py`	`evaluate()` endpoint payload now requires `config_id` (UUID) and `config_version` (int) instead of inline `config`; some dataset route decorators changed from `"/"` to `""`.
DB Migration `backend/app/alembic/versions/041_add_config_in_evals_run_table.py`	Migration adds `config_id` (UUID FK) and `config_version` (Integer) to `evaluation_run`, drops old `config` JSONB; downgrade restores previous schema.
Tests `backend/app/tests/...`	Tests updated to use `create_test_config` and reference `config_id`/`config_version`; replaced inline configs with persisted config references and `KaapiLLMParams`.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client/API
  participant API as Evaluations API
  participant Service as Evaluations Service
  participant DB as Config DB
  participant Resolver as Config Blob Resolver
  participant Batch as Batch Orchestrator

  Client->>API: POST /evaluations (dataset_id, config_id, config_version, ...)
  API->>Service: start_evaluation(dataset_id, config_id, config_version, ...)
  Service->>DB: ConfigVersionCrud.get_by_id(config_id, config_version)
  DB-->>Service: ConfigVersion record (with blob)
  Service->>Resolver: resolve_config_blob(config_version.blob)
  Resolver-->>Service: LLMCallConfig (completion.params, model)
  Service->>Service: validate provider == OPENAI
  Service->>Batch: start_evaluation_batch(resolved_completion_params, model)
  Batch-->>Service: batch run created / status
  Service-->>API: 201 Created (EvaluationRun with config_id/config_version)
  API-->>Client: 201 Created (response)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Unified API: Add support for Kaapi Abstracted LLM Call #498 — overlaps on LLM config types (ConfigBlob, LLMCallConfig, KaapiLLMParams) and config-blob resolution logic used here.
Evaluation #405 — earlier evaluation flow changes switching to stored config references and resolver functions; touches same CRUD/service paths.
Kaapi v1.0: Permissions Review and Authorization Cleanup #501 — modifies evaluation route handlers and authorization around similar endpoints affected by this change.

Suggested reviewers

avirajsingh7
Prajna1999

Poem

🐇 I hopped from dicts to IDs with glee,
I fetched the blobs to find the model key.
One embedding tune, one migration moon,
Tests leapt forward — I munched a carrot soon. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title "Evaluation: Use Config Management" clearly and concisely summarizes the main change: refactoring evaluation runs to use stored configuration management instead of inline config dictionaries.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

backend/app/crud/evaluations/embeddings.py (1)
366-367: Misleading comment - update to reflect actual behavior.

The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.
-        # Get embedding model from config (default: text-embedding-3-large)
-        embedding_model = "text-embedding-3-large"
+        # Use fixed embedding model (text-embedding-3-large)
+        embedding_model = "text-embedding-3-large"
backend/app/tests/api/routes/test_evaluation.py (1)
524-545: Consider renaming function to match its new purpose.

The function test_start_batch_evaluation_missing_model was repurposed to test invalid config_id scenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.
-    def test_start_batch_evaluation_missing_model(self, client, user_api_key_header):
-        """Test batch evaluation fails with invalid config_id."""
+    def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header):
+        """Test batch evaluation fails with invalid config_id."""
backend/app/api/routes/evaluation.py (1)
499-510: Consider validating that model is present in config params.

The model is extracted with .get("model") which returns None if not present. Since model is critical for cost tracking (used in create_langfuse_dataset_run), consider validating its presence and returning an error if missing.
     # Extract model from config for storage
     model = config.completion.params.get("model")
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Config must specify a 'model' in completion params for evaluation",
+        )

     # Create EvaluationRun record with config references
backend/app/crud/evaluations/core.py (1)
15-69: Config-based create_evaluation_run refactor is correctly implemented; consider logging model for improved traceability.

The refactor from inline config dict to config_id: UUID and config_version: int is properly implemented throughout:

The sole call site in backend/app/api/routes/evaluation.py:503 correctly passes all new parameters with the right types (config_id as UUID, config_version as int, model extracted from config).

The EvaluationRun model in backend/app/models/evaluation.py correctly defines all three fields with appropriate types and descriptions.

All type hints align with Python 3.11+ guidelines.

One suggested improvement for debugging:

Include model in the creation log for better traceability when correlating evaluation runs with model versions:
logger.info(
    f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-   f"config_id={config_id}, config_version={config_version}"
+   f"config_id={config_id}, config_version={config_version}, model={model}"
)
Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef268 and d5f9d4d.

📒 Files selected for processing (7)

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py (1 hunks)
backend/app/api/routes/evaluation.py (5 hunks)
backend/app/crud/evaluations/core.py (5 hunks)
backend/app/crud/evaluations/embeddings.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)
backend/app/models/evaluation.py (3 hunks)
backend/app/tests/api/routes/test_evaluation.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/api/routes/evaluation.py
backend/app/models/evaluation.py
backend/app/crud/evaluations/embeddings.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py
backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py

🧬 Code graph analysis (2)

backend/app/tests/api/routes/test_evaluation.py (2)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-115)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-130)

EvaluationRun (133-248)

backend/app/crud/evaluations/processing.py (1)

backend/app/crud/evaluations/langfuse.py (1)

create_langfuse_dataset_run (20-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (3)

backend/app/crud/evaluations/processing.py (1)

257-264: LGTM! Clean refactor to use stored model field.

The change correctly retrieves the model from eval_run.model instead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.

backend/app/models/evaluation.py (1)

148-157: LGTM! Well-structured config reference fields.

The new config_id and config_version fields properly establish the relationship to stored configs with appropriate constraints (ge=1 for version). The nullable design allows backward compatibility with existing data.

backend/app/api/routes/evaluation.py (1)

478-495: LGTM! Robust config resolution with provider validation.

The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.

backend/app/alembic/versions/040_add_config_in_evals_run_table.py

backend/app/models/evaluation.py

codecov · 2025-12-10T04:48:57Z

Codecov Report

❌ Patch coverage is 85.07463% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/services/evaluations/evaluation.py	25.00%	6 Missing ⚠️
backend/app/crud/evaluations/core.py	87.50%	2 Missing ⚠️
backend/app/crud/evaluations/batch.py	83.33%	1 Missing ⚠️
backend/app/crud/evaluations/embeddings.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/app/models/evaluation.py (1)
148-158: Align EvaluationRun type hints with nullable DB columns for config fields

config_id and config_version are nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.

Consider updating the annotations to reflect nullability:
-    config_id: UUID = SQLField(
+    config_id: UUID | None = SQLField(
         foreign_key="config.id",
         nullable=True,
         description="Reference to the stored config used for this evaluation",
     )
-    config_version: int = SQLField(
+    config_version: int | None = SQLField(
         nullable=True,
         ge=1,
         description="Version of the config used for this evaluation",
     )
This keeps the schema the same while making runtime and type expectations clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5f9d4d and eda7762.

📒 Files selected for processing (1)

backend/app/models/evaluation.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/models/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (1)

backend/app/models/evaluation.py (1)

271-273: Public model nullability now matches the schema

Making config_id, config_version, and model nullable in EvaluationRunPublic correctly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

backend/app/api/routes/evaluation.py

…ssistant_id handling

…g dict

…nstead of config dict

…ig_version fields

…ve_model_from_config function, and update processing logic to use config references

Prajna1999

lgtm

avirajsingh7 · 2025-12-18T04:02:59Z

hold merge- untill frontend is ready.

Prajna1999 · 2025-12-22T03:35:18Z

good to go. Can be merged

…sion - Resolved merge conflicts with main branch (ef56025 refactor) - Updated evaluation routes and services to use config_id/config_version instead of inline config dict - Deleted old evaluation.py route (replaced by new evaluations/ directory) - Updated tests to use config_id/config_version pattern - Kept resolve_model_from_config function for cron job processing Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/crud/evaluations/core.py`:
- Around line 66-69: The logger.info call that reports the created EvaluationRun
(the one referencing eval_run.id, run_name, config_id, config_version) must be
prefixed with the function name tag; update the log message in
create_evaluation_run to start with "[create_evaluation_run]" so the entry reads
like "[create_evaluation_run] Created EvaluationRun record: id=...,
run_name=..., config_id=..., config_version=...". Ensure you only change the log
string (the logger.info call) and keep the existing variables (eval_run,
run_name, config_id, config_version) intact.

♻️ Duplicate comments (3)

backend/app/services/llm/providers/registry.py (1)

16-24: Handle the new OPENAI alias in get_llm_provider.

Line 16-24 registers "openai" but Line 66-74 only accepts OPENAI_NATIVE, so provider_type="openai" will raise ValueError despite being registered. This makes the new provider unusable.
🔧 Proposed fix
-    if provider_type == LLMProvider.OPENAI_NATIVE:
+    if provider_type in (LLMProvider.OPENAI_NATIVE, LLMProvider.OPENAI):
         if "api_key" not in credentials:
             raise ValueError("OpenAI credentials not configured for this project.")
         client = OpenAI(api_key=credentials["api_key"])
Also applies to: 66-74

backend/app/crud/evaluations/batch.py (1)

107-119: Avoid sending None fields in the Responses API payload.

Line 107-119 always includes optional fields and always builds tools, so vector_store_ids=None is sent when no KB is configured. This can cause request failures. Build the body conditionally and omit None values.

🔧 Proposed fix (conditional body)

-        batch_request = {
-            "custom_id": item["id"],
-            "method": "POST",
-            "url": "/v1/responses",
-            "body": {
-                # Use config as-is
-                "model": config.model,
-                "instructions": config.instructions,
-                "temperature": config.temperature,
-                "reasoning": {"effort": config.reasoning} if config.reasoning else None,
-                "tools": [
-                    {
-                        "type": "file_search",
-                        "vector_store_ids": config.knowledge_base_ids,
-                        "max_num_results": config.max_num_results or 20,
-                    }
-                ],
-                "input": question,  # Add input from dataset
-            },
-        }
+        body: dict[str, Any] = {"model": config.model, "input": question}
+        if config.instructions is not None:
+            body["instructions"] = config.instructions
+        if config.temperature is not None:
+            body["temperature"] = config.temperature
+        if config.reasoning is not None:
+            body["reasoning"] = {"effort": config.reasoning}
+        if config.knowledge_base_ids:
+            body["tools"] = [
+                {
+                    "type": "file_search",
+                    "vector_store_ids": config.knowledge_base_ids,
+                    "max_num_results": config.max_num_results or 20,
+                }
+            ]
+
+        batch_request = {
+            "custom_id": item["id"],
+            "method": "POST",
+            "url": "/v1/responses",
+            "body": body,
+        }

OpenAI Responses API: are optional fields allowed to be null, and must tools.vector_store_ids be a non-null list?

backend/app/crud/evaluations/core.py (1)

343-366: Validate completion params and model before returning.

Line 366 dereferences config.completion.params.model without checking if completion/params exist or if model is empty, which can raise or violate the declared str return contract.

🔧 Proposed fix

-    model = config.completion.params.model
-    return model
+    if not config.completion or not config.completion.params:
+        raise ValueError(
+            f"Config for evaluation {eval_run.id} is missing completion params "
+            f"(config_id={eval_run.config_id}, version={eval_run.config_version})"
+        )
+
+    model = config.completion.params.model
+    if not model:
+        raise ValueError(
+            f"Config for evaluation {eval_run.id} has no model specified "
+            f"(config_id={eval_run.config_id}, version={eval_run.config_version})"
+        )
+    return model

🧹 Nitpick comments (2)

backend/app/services/evaluations/evaluation.py (1)
101-114: Add explicit None check for type safety.

After resolve_config_blob returns, the type of config is ConfigBlob | None. While the function contract implies config is non-None when error is None, accessing config.completion.provider on line 110 without an explicit check could cause issues with type checkers and is less defensive.
🛠️ Suggested improvement
     config, error = resolve_config_blob(
         config_crud=config_version_crud,
         config=LLMCallConfig(id=config_id, version=config_version),
     )
-    if error:
+    if error or config is None:
         raise HTTPException(
             status_code=400,
             detail=f"Failed to resolve config from stored config: {error}",
         )
-    elif config.completion.provider != LLMProvider.OPENAI:
+
+    if config.completion.provider != LLMProvider.OPENAI:
         raise HTTPException(
             status_code=422,
             detail="Only 'openai' provider is supported for evaluation configs",
         )
backend/app/tests/api/routes/test_evaluation.py (1)
554-577: Test name is inconsistent with test behavior after refactor.

The test method is named test_start_batch_evaluation_missing_model but the docstring and actual test behavior validate that the evaluation fails with an invalid/non-existent config_id. Consider renaming for clarity.
🛠️ Suggested rename
-    def test_start_batch_evaluation_missing_model(
+    def test_start_batch_evaluation_invalid_config_id(
         self, client: TestClient, user_api_key_header: dict[str, str]
     ) -> None:
         """Test batch evaluation fails with invalid config_id."""

backend/app/crud/evaluations/core.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

backend/app/tests/crud/evaluations/test_processing.py (6)
260-291: Add a type hint for test_dataset.

test_dataset is untyped in this fixture. The guidelines require type hints on all params/returns. As per coding guidelines, please annotate the fixture parameter.
🔧 Proposed fix
-    def eval_run_with_batch(self, db: Session, test_dataset) -> EvaluationRun:
+    def eval_run_with_batch(
+        self, db: Session, test_dataset: EvaluationDataset
+    ) -> EvaluationRun:
402-418: Add missing type hints and return type.

This async test lacks a type hint for test_dataset and a return type. As per coding guidelines, please annotate both.
🔧 Proposed fix
-    async def test_process_completed_evaluation_no_batch_job_id(
-        self, db: Session, test_dataset
-    ):
+    async def test_process_completed_evaluation_no_batch_job_id(
+        self, db: Session, test_dataset: EvaluationDataset
+    ) -> None:
459-491: Add a type hint for test_dataset.

This fixture parameter is untyped. As per coding guidelines, please add the type hint.
🔧 Proposed fix
-    def eval_run_with_embedding_batch(self, db: Session, test_dataset) -> EvaluationRun:
+    def eval_run_with_embedding_batch(
+        self, db: Session, test_dataset: EvaluationDataset
+    ) -> EvaluationRun:
606-648: Add missing type hints and return type.

test_dataset is untyped and the return type is missing. As per coding guidelines, please annotate both.
🔧 Proposed fix
-    async def test_check_and_process_evaluation_completed(
+    async def test_check_and_process_evaluation_completed(
         self,
         mock_process,
         mock_poll,
         mock_get_batch,
         db: Session,
-        test_dataset,
-    ):
+        test_dataset: EvaluationDataset,
+    ) -> None:
675-717: Add missing type hints and return type.

test_dataset is untyped and the return type is missing. As per coding guidelines, please annotate both.
🔧 Proposed fix
-    async def test_check_and_process_evaluation_failed(
+    async def test_check_and_process_evaluation_failed(
         self,
         mock_poll,
         mock_get_batch,
         db: Session,
-        test_dataset,
-    ):
+        test_dataset: EvaluationDataset,
+    ) -> None:
779-822: Add missing type hints and return type.

test_dataset is untyped and the return type is missing. As per coding guidelines, please annotate both.
🔧 Proposed fix
-    async def test_poll_all_pending_evaluations_with_runs(
+    async def test_poll_all_pending_evaluations_with_runs(
         self,
         mock_langfuse_client,
         mock_openai_client,
         mock_check,
         db: Session,
-        test_dataset,
-    ):
+        test_dataset: EvaluationDataset,
+    ) -> None:

🧹 Nitpick comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)
555-578: Consider renaming the test for clarity.

The test now validates invalid config_id, but the name still says “missing_model”. Renaming improves intent clarity.
♻️ Suggested rename
-def test_start_batch_evaluation_missing_model(
+def test_start_batch_evaluation_invalid_config_id(
     self, client: TestClient, user_api_key_header: dict[str, str]
 ) -> None:

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/crud/evaluations/embeddings.py`:
- Around line 368-369: The comment above the assignment to embedding_model is
misleading—change it to state that the model is a module-level constant rather
than being read from configuration; update the comment referencing
EMBEDDING_MODEL (used when assigning embedding_model = EMBEDDING_MODEL) to
something like "Use module-level EMBEDDING_MODEL constant" so it accurately
describes the behavior.

🧹 Nitpick comments (1)

backend/app/crud/evaluations/embeddings.py (1)
24-26: Consider adding an explicit type annotation to the constant.

Per the coding guidelines recommending type hints throughout the codebase, adding an explicit type annotation improves clarity and tooling support.
Suggested change
 # Default embedding model
-EMBEDDING_MODEL = "text-embedding-3-large"
+EMBEDDING_MODEL: str = "text-embedding-3-large"

backend/app/crud/evaluations/embeddings.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/services/evaluations/evaluation.py`:
- Around line 94-111: The current flow assumes resolve_evaluation_config
returned a fully populated object and dereferences config.completion.provider,
which can be None/missing and cause AttributeError; update the handling after
resolve_evaluation_config (the block that checks error and then inspects
config.completion.provider) to first verify config is not None and that config
has a completion attribute with the expected fields (e.g., config.completion and
config.completion.provider and config.completion.params) before accessing
provider; if any are missing, raise an HTTPException with a 400 (or 422 as
appropriate) and a clear detail string like "Incomplete evaluation config:
missing completion/provider/params" so unresolved/partial configs return
controlled 4xx errors instead of a 500.

♻️ Duplicate comments (1)

backend/app/crud/evaluations/core.py (1)

372-392: Validate completion params/model before returning.

config.completion.params.model can be missing or None, which violates the declared str return type and can break downstream calls. This mirrors a previously flagged issue.

🔧 Proposed fix

-    if not eval_run.config_id or not eval_run.config_version:
+    if eval_run.config_id is None or eval_run.config_version is None:
         raise ValueError(
             f"Evaluation run {eval_run.id} has no config reference "
             f"(config_id={eval_run.config_id}, config_version={eval_run.config_version})"
         )
@@
-    model = config.completion.params.model
-    return model
+    if not config.completion or not config.completion.params:
+        raise ValueError(
+            f"Config for evaluation {eval_run.id} is missing completion params"
+        )
+
+    model = config.completion.params.model
+    if not model:
+        raise ValueError(
+            f"Config for evaluation {eval_run.id} does not specify a model "
+            f"(config_id={eval_run.config_id}, version={eval_run.config_version})"
+        )
+    return model

🧹 Nitpick comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

529-544: Prefer deriving config_version from the created config.

Hardcoding config_version=1 (e.g., Line 543, Line 767, Line 807, Line 868, Line 906) can drift if create_test_config changes its default versioning. Consider using the version exposed by the created config object to keep tests resilient.

Also applies to: 759-768, 799-808, 862-868, 899-906

coderabbitai · 2026-01-26T14:31:39Z

backend/app/services/evaluations/evaluation.py

+    # Step 2: Resolve config from stored config management
+    config, error = resolve_evaluation_config(
        session=session,
-        config=config,
-        assistant_id=assistant_id,
+        config_id=config_id,
+        config_version=config_version,
        project_id=project_id,
    )
+    if error:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Failed to resolve config from stored config: {error}",
+        )
+    elif config.completion.provider != LLMProvider.OPENAI:
+        raise HTTPException(
+            status_code=422,
+            detail="Only 'openai' provider is supported for evaluation configs",
+        )
+


⚠️ Potential issue | 🟠 Major

Guard against unresolved/partial config before dereferencing.

resolve_evaluation_config can return config=None or a blob missing completion/params. The current flow dereferences those fields, which can crash with AttributeError and return a 500 instead of a controlled 4xx.

🔧 Proposed fix

- config, error = resolve_evaluation_config( + config, error = resolve_evaluation_config( session=session, config_id=config_id, config_version=config_version, project_id=project_id, ) - if error: + if error or config is None: raise HTTPException( status_code=400, detail=f"Failed to resolve config from stored config: {error}", ) - elif config.completion.provider != LLMProvider.OPENAI: + if not config.completion or not config.completion.params: + raise HTTPException( + status_code=422, + detail="Config is missing completion parameters required for evaluation", + ) + if config.completion.provider != LLMProvider.OPENAI: raise HTTPException( status_code=422, detail="Only 'openai' provider is supported for evaluation configs", )

🤖 Prompt for AI Agents

In `@backend/app/services/evaluations/evaluation.py` around lines 94 - 111, The current flow assumes resolve_evaluation_config returned a fully populated object and dereferences config.completion.provider, which can be None/missing and cause AttributeError; update the handling after resolve_evaluation_config (the block that checks error and then inspects config.completion.provider) to first verify config is not None and that config has a completion attribute with the expected fields (e.g., config.completion and config.completion.provider and config.completion.params) before accessing provider; if any are missing, raise an HTTPException with a 400 (or 422 as appropriate) and a clear detail string like "Incomplete evaluation config: missing completion/provider/params" so unresolved/partial configs return controlled 4xx errors instead of a 500.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/041_add_config_in_evals_run_table.py`:
- Line 19: Add Python type hints to the migration functions: update the
signature of upgrade to include a return type (e.g., change def upgrade() to def
upgrade() -> None) and do the same for the corresponding downgrade function (def
downgrade() -> None), ensuring both functions have explicit type hints for
parameters (none) and return values to satisfy the codebase typing guidelines.

♻️ Duplicate comments (2)

backend/app/alembic/versions/041_add_config_in_evals_run_table.py (2)
19-41: Prevent data loss when dropping evaluation_run.config.

Line 41 drops the old JSONB config without migrating existing data into config_id/config_version, which will permanently lose configuration data for existing rows. Please add a data migration/backfill (or explicitly document why data loss is acceptable) before dropping the column.

45-53: Downgrade will fail if rows exist due to non-nullable config.

op.add_column(..., nullable=False) will fail on a non-empty table. Make it nullable (or add a default/backfill before enforcing NOT NULL).
✅ Suggested fix
     op.add_column(
         "evaluation_run",
         sa.Column(
             "config",
             postgresql.JSONB(astext_type=sa.Text()),
             autoincrement=False,
-            nullable=False,
+            nullable=True,
             comment="Evaluation configuration (model, instructions, etc.)",
         ),
     )

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/crud/evaluations/batch.py`:
- Around line 103-110: The body dict currently always sets "instructions":
config.instructions which can send a None value; update the code that constructs
the request body (the body dict) to only add the "instructions" key when
config.instructions is not None, following the same conditional pattern used for
"reasoning" and "tools" (i.e., check config.instructions and only include the
field when present) so that the API never receives "instructions": None.

🧹 Nitpick comments (2)

backend/app/crud/evaluations/batch.py (2)
117-117: Redundant length check.

The expression config.knowledge_base_ids and len(config.knowledge_base_ids) > 0 is redundant. A non-empty list is truthy, so if config.knowledge_base_ids: suffices.
♻️ Proposed simplification
-        if config.knowledge_base_ids and len(config.knowledge_base_ids) > 0:
+        if config.knowledge_base_ids:
154-156: Update docstring to reflect the new type.

The docstring describes config as a "dict with llm, instructions, vector_store_ids" but it is now KaapiLLMParams with fields model, instructions, knowledge_base_ids, etc.
📝 Proposed docstring update
-        config: Evaluation configuration dict with llm, instructions, vector_store_ids
+        config: KaapiLLMParams with model, instructions, knowledge_base_ids, etc.

kartpop

approved with comments

backend/app/services/llm/providers/registry.py

Prajna1999 · 2026-01-28T03:53:49Z

backend/app/crud/evaluations/batch.py


 def build_evaluation_jsonl(
-    dataset_items: list[dict[str, Any]], config: dict[str, Any]
+    dataset_items: list[dict[str, Any]], config: KaapiLLMParams


To make sure Text, TTS and STT config works, KaapiLLMParams looks like this in the STT PR

class TextLLMParams(SQLModel): model: str instructions: str | None = Field( default=None, ) knowledge_base_ids: list[str] | None = Field( default=None, description="List of vector store IDs to use for knowledge retrieval", ) reasoning: Literal["low", "medium", "high"] | None = Field( default=None, description="Reasoning configuration or instructions", ) temperature: float | None = Field( default=None, ge=0.0, le=2.0, ) max_num_results: int | None = Field( default=None, ge=1, description="Maximum number of candidate results to return", ) class STTLLMParams(SQLModel): model: str instructions: str input_language: str | None = None output_language: str | None = None response_format: Literal["text"] | None = Field( None, description="Can take multiple response_format like text, json, verbose_json.", ) temperature: float | None = Field( default=0.2, ge=0.0, le=2.0, ) class TTSLLMParams(SQLModel): model: str voice: str language: str response_format: Literal["mp3", "wav", "ogg"] | None = "wav" speed: float | None = Field(None, ge=0.25, le=4.0) KaapiLLMParams = Union[TextLLMParams, STTLLMParams, TTSLLMParams]

Union of all these related pydantic models. May be we can add Any type to not force type safety atm and once the STT PR gets merged, plan accordingly.

Prajna1999 · 2026-01-28T03:55:02Z

backend/app/crud/evaluations/batch.py

    session: Session,
    eval_run: EvaluationRun,
-    config: dict[str, Any],
+    config: KaapiLLMParams,


ditto as above

backend/app/services/llm/providers/registry.py

Prajna1999 · 2026-01-28T04:08:56Z

backend/app/services/llm/providers/registry.py


 class LLMProvider:
    OPENAI_NATIVE = "openai-native"
+    OPENAI = "openai"


This LLMProvider by design takes only OPENAI_NATIVE, GOOGLE_NATIVE for effective flow from top. Once we add OPENAI also it will create issues with unified API. Let's remove this enum for now. Once the new STT PR gets merged, this extra field will probably not required.

will take this up in next PR

backend/app/alembic/versions/041_add_config_in_evals_run_table.py

avirajsingh7 linked an issue Dec 10, 2025 that may be closed by this pull request

Add config management in Evals #440

Closed

avirajsingh7 self-assigned this Dec 10, 2025

avirajsingh7 added enhancement New feature or request ready-for-review labels Dec 10, 2025

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

backend/app/alembic/versions/040_add_config_in_evals_run_table.py Outdated Show resolved Hide resolved

backend/app/alembic/versions/040_add_config_in_evals_run_table.py Show resolved Hide resolved

backend/app/models/evaluation.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

avirajsingh7 requested review from AkhileshNegi and Prajna1999 December 10, 2025 05:06

AkhileshNegi requested changes Dec 10, 2025

View reviewed changes

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py Outdated Show resolved Hide resolved

backend/app/api/routes/evaluation.py Outdated Show resolved Hide resolved

backend/app/api/routes/evaluation.py Outdated Show resolved Hide resolved

Prajna1999 reviewed Dec 12, 2025

View reviewed changes

backend/app/api/routes/evaluation.py Outdated Show resolved Hide resolved

avirajsingh7 force-pushed the evals/config_addition branch from f5b94b0 to cdb0b2e Compare December 15, 2025 07:13

avirajsingh7 added 9 commits December 15, 2025 14:01

Refactor evaluation endpoint to use stored configuration and remove a…

3f8ddcf

…ssistant_id handling

Refactor evaluation run to use config ID and version instead of confi…

5280622

…g dict

Add config_id, config_version, and model fields to evaluation run table

13eb778

Refactor batch evaluation tests to use config_id and config_version i…

7bdd322

…nstead of config dict

Update EvaluationRunPublic model to allow nullable config_id and conf…

8f9561c

…ig_version fields

Refactor evaluation run model handling: remove model field, add resol…

f612da4

…ve_model_from_config function, and update processing logic to use config references

fix migration number

4f89f43

fix test

82bee43

fix status code

a2c8a95

avirajsingh7 force-pushed the evals/config_addition branch from c9cc51a to a2c8a95 Compare December 15, 2025 08:34

remove old mirgation

b9fd664

avirajsingh7 requested review from AkhileshNegi and Prajna1999 December 16, 2025 04:46

Prajna1999 approved these changes Dec 16, 2025

View reviewed changes

AkhileshNegi approved these changes Dec 16, 2025

View reviewed changes

AkhileshNegi linked an issue Jan 24, 2026 that may be closed by this pull request

Config Management: Evaluation #557

Closed

AkhileshNegi and others added 3 commits January 24, 2026 13:17

fix: resolve merge conflicts after pull

a29bb77

Co-Authored-By: Claude Opus 4.5 <[email protected]>

fixing endpoints

f00e7e0

coderabbitai bot reviewed Jan 24, 2026

View reviewed changes

backend/app/crud/evaluations/core.py Show resolved Hide resolved

updated testcases

e772e50

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

defining constants

92e2fea

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

backend/app/crud/evaluations/embeddings.py Outdated Show resolved Hide resolved

AkhileshNegi added 3 commits January 25, 2026 22:34

minor cleanups

1e01a9f

cleanup config blob

d96809e

coderabbit suggestions

b1bc453

ProjectTech4DevAI deleted a comment from coderabbitai bot Jan 26, 2026

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

cleanup migration

4109869

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

AkhileshNegi added 2 commits January 26, 2026 20:17

cleanup

4e552da

update config setup

85945b2

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

coderabbit suggestions

e683ad2

AkhileshNegi requested review from Prajna1999 and kartpop January 27, 2026 05:46

ProjectTech4DevAI deleted a comment from coderabbitai bot Jan 27, 2026

kartpop approved these changes Jan 27, 2026

View reviewed changes

backend/app/services/llm/providers/registry.py Show resolved Hide resolved

ProjectTech4DevAI deleted a comment from coderabbitai bot Jan 27, 2026

Prajna1999 reviewed Jan 28, 2026

View reviewed changes

backend/app/alembic/versions/041_add_config_in_evals_run_table.py Show resolved Hide resolved

Merge branch 'main' into evals/config_addition

9569e6a

AkhileshNegi merged commit 6ac9391 into main Jan 28, 2026
2 of 3 checks passed

AkhileshNegi deleted the evals/config_addition branch January 28, 2026 06:55

Evaluation: Use Config Management #477

Evaluation: Use Config Management #477

Uh oh!

Conversation

avirajsingh7 commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Prajna1999 left a comment

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 commented Dec 18, 2025

Uh oh!

Prajna1999 commented Dec 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kartpop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Prajna1999 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Prajna1999 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Prajna1999 Jan 28, 2026

Choose a reason for hiding this comment

avirajsingh7 commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 10, 2025 •

edited

Loading

codecov bot commented Dec 10, 2025 •

edited

Loading